Project Approach, Methodology, and Rationale

To determine which NBA Players are promising acquisitions to the team —— meaning they are statistically high performing yet underpaid —— a K-means clustering model is appropriate. In ensuring our clustering algorithm is concerned with features that are most indicative of both salary and performance, the first step is to investigate which player statistics have the highest correlation with salary. These features will be used to build the K-means clustering model.

In the process of building the model, it is first crucial to calculate and visualize the optimal number of clusters to explain the players. At this optimal number of centers, It will then be investigated, through graphing and visualization, how each cluster explains player profile.

Visualizing our clusters against both salary and a particularly indicative performance metric will be used to indicate those players that are valuable additions to the team, and underpaid enough that they are likely to transition to our team when offered a higher salary

Calculating Feature Correlation with Salary

##    2020-21        Age          G         GS         MP         FG        FGA 
## 1.00000000 0.39028405 0.15046768 0.53757945 0.46486266 0.58107265 0.57338021 
##        FG%         3P        3PA        3P%         2P        2PA        2P% 
## 0.10484552 0.42788671 0.42413825 0.11162594 0.53467499 0.54463327 0.03277157 
##       eFG%         FT        FTA        FT%        ORB        DRB        TRB 
## 0.10428497 0.56782049 0.55506809 0.19096316 0.19927172 0.46545409 0.41729829 
##        AST        STL        BLK        TOV         PF        PTS 
## 0.59041905 0.44600283 0.22037462 0.57616641 0.28299940 0.59414446

This output contains the correlation of each variable in the data set with salary. To subset to only those features most explanatory of salary, the data to be used in the model will only comprise of features based from correlation cuttoff of .55. PTS (Points), AST (Assists), TOV (Turnovers), FTA (Free throws attempted), FGA (Field goals attempted), FT (Free throws made), and FG (Field goals made) will be used to cluster our players in the model. This data post-normalization and ready for model fitting is exhibited below:

##          PTS        AST        TOV        FTA        FGA         FT         FG
## 1 0.24881292 0.23188406 0.35570470 0.22038567 0.27879581 0.16442953 0.24010554
## 2 0.24406458 0.18260870 0.19463087 0.11019284 0.33507853 0.09731544 0.25065963
## 3 0.07122507 0.02028986 0.07382550 0.03305785 0.08246073 0.02684564 0.06596306
## 4 0.11490978 0.04347826 0.08724832 0.06887052 0.11780105 0.06711409 0.11609499
## 5 0.32003799 0.24347826 0.18120805 0.04958678 0.40837696 0.04697987 0.36411609
## 6 0.04463438 0.04347826 0.06711409 0.02479339 0.05628272 0.02348993 0.04485488

Determining the Optimal Number of Clusters

Elbow Curve Method:

The elbow curve suggests that three is the optimal number of clusters.

NbClust Method

The NBClust method equally recommends both 2 and 3 clusters.

Thus, the model will be run with both two and three clusters, and the best performing model will be used.

Model Building and Explained Variance With 2 Clusters

# Model with 2 clusters

set.seed(17)
kmeans_2 = kmeans(clust_data, centers = 2, algorithm = "Lloyd")

#Evaluate the quality of the clustering 
betweenss_2 = kmeans_2$betweenss

# Total variance, "totss" is the sum of the distances between all the points in the data set.
totss_2 = kmeans_2$totss

# Variance accounted for by clusters.
(var_exp_2 = betweenss_2 / totss_2)
## [1] 0.5933796

The percentage of variation that is explained by the model with two centers is 59%.

Model Building and Explained Variance With 3 Clusters

# Model with 3 clusters

set.seed(17)
kmeans_3 = kmeans(clust_data, centers = 3, algorithm = "Lloyd")

#Evaluate the quality of the clustering 
betweenss_3 = kmeans_3$betweenss

# Total variance, "totss" is the sum of the distances between all the points in the data set.
totss_3 = kmeans_3$totss

# Variance accounted for by clusters.
(var_exp_3 = betweenss_3 / totss_3)
## [1] 0.7696085

The percentage of variation that is explained by the model with three centers is 77%.

The Model built with three centers has an explained variance of .77 (77%), compared to the model with two centers, which has a much lower explained variance of .59 (59%). Three clusters is the optimal choice for this data.

Visualizing Clusters

The above graph shows the relationship between player salary by player points, colored by cluster. My rationale in visualizing the clusters derived from the subsetted data with these axes is that Points (PTS) is most indicative of the performance of a player and also one of the most highly correlated features with salary. Thus, graphing Salary by Points made will allow us to determine those players that are high performing yet underpaid.

The clusters can be interpreted to represent groups of varying skill sets among NBA players, as well as a moderate measure of player salary. The blue group (cluster 2), represents players with the lowest point performance, and are paid a strict range of low salaries. The red group (cluster 1), represents players with a moderate/average range of point performance relative to the entire population, and are paid a moderately varied range of salaries. The green group (cluster 3), represents players with the greatest point performance, yet having the greatest disparity in salary between players in its group. In other words, players in cluster 3 perform at relatively equal levels, yet there is the widest range in salary across the group, and the greatest maximum salary of the entire population.

Conclusion

In determining players with the greatest potential payoff to the team, it is important to look for players in cluster 3 alone. Cluster 3 is the subset of the population that is highest performing. Furthermore, within cluster 3, the most likely to be tempted to convert to our team are those that are paid the lowest of the salary range in this cluster. Thus, I suggest that players Zion Williamson, Luka Doni, and Trae Young be priority recruits for our team. These are players with some of the highest point statistics in their group, yet paid the lowest salaries for their level of performance. Williamson, Doni, and Young are comparable and —— in some cases, higher performing —— to other players in cluster 3 that are paid almost four times as much. Moreover, shockingly, these players have about 57 times the number of points that other players do with similar / equal salaries. Thus, from the model and appropriate visualizations, I conclude that Williamson, Doni, and Young, as well as similar player profiles, are most valuable acquisitions to our team.